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This  review  describes  a  meta-analysis  of  findings from  50  controlled  evaluations 
of  intelligent  computer  tutoring  systems.  The  median  effect  of  intelligent 
tutoring  in  the  50  evaluations  was  to  raise  test  scores  0.66  standard  devia¬ 
tions  over  conventional  levels,  or  from  the  50th  to  the  75th  percentile. 
However,  the  amount  of  improvement  found  in  an  evaluation  depended  to  a 
great  extent  on  whether  improvement  was  measured  on  locally  developed  or 
standardized  tests,  suggesting  that  alignment  of  test  and  instructional  objec¬ 
tives  is  a  critical  determinant  of  evaluation  results.  The  review  also  describes 
findings  from  two  groups  of  evaluations  that  did  not  meet  all  of  the  selection 
requirements  for  the  meta-analysis:  six  evaluations  with  non  conventional 
control  groups  and  four  with  flawed  implementations  of  intelligent  tutoring 
systems.  Intelligent  tutoring  effects  in  these  evaluations  were  small,  suggest¬ 
ing  that  evaluation  results  are  also  affected  by  the  nature  of  control  treat¬ 
ments  and  the  adequacy  of  program  implementations. 

Keywords:  intelligent  tutoring  systems,  computer-assisted  instruction, 
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Computer  tutoring  is  a  late  development  in  the  long  history  of  tutoring  in  edu¬ 
cation.  Whereas  human  tutoring  has  been  used  in  schools  for  2,500  years — or  for 
as  long  as  schools  have  existed — computer  tutoring  is  largely  a  product  of  the  past 
half  century.  The  first  computer  tutoring  systems  to  be  used  in  school  classrooms 
(e.g.,  R.  C.  Atkinson,  1968;  Suppes  &  Morningstar,  1969)  showed  the  influence 
of  the  programmed  instruction  movement  of  the  time:  They  presented  instruction 
in  short  segments  or  frames,  asked  questions  frequently  during  instruction,  and 
provided  immediate  feedback  on  answers  (Crowder,  1959;  Skinner,  1958).  A  dif¬ 
ferent  type  of  computer  tutoring  system  appeared  in  research  laboratories  and 
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classrooms  during  the  1970s  and  1980s  (e.g.,  Carbonell,  1970;  Fletcher,  1985; 
Sleeman  &  Brown,  1982).  Grounded  in  artificial  intelligence  concepts  and  cogni¬ 
tive  theory,  these  newer  systems  guided  learners  through  each  step  of  a  problem 
solution  by  creating  hints  and  feedback  as  needed  from  expert-knowledge  data¬ 
bases.  The  first-generation  computer  tutors  have  been  given  the  retronym  CAI 
tutors  (for  computer-assisted  instruction  tutors );  the  second-generation  tutors  are 
usually  called  intelligent  tutoring  systems,  or  ITSs  (VanLehn,  2011). 

VanLehn  (2011)  has  summarized  common  beliefs  about  the  effectiveness  of 
different  types  of  tutoring.  According  to  VanLehn,  CAI  tutors  are  generally 
believed  to  boost  examination  scores  by  0.3  standard  deviations  over  usual  levels, 
or  from  the  50th  to  the  62nd  percentile.  ITSs  are  thought  to  be  more  effective, 
raising  test  performance  by  about  1  standard  deviation,  or  from  the  50th  to  the 
84th  percentile.  Human  tutors  are  thought  to  be  most  effective  of  all,  raising  test 
scores  by  2  standard  deviations,  or  from  the  50th  to  the  98th  percentile. 

These  conventional  views  on  tutoring  effectiveness  are  based  on  research  from 
decades  ago.  VanLehn  (2011)  attributed  the  belief  that  CAI  tutors  produce  gains 
of  around  0.3  standard  deviations  to  a  meta-analytic  review  of  165  studies  (C.-L. 
C.  Kulik  &  Kulik,  1991).  He  attributed  the  belief  that  ITSs  produce  1-standard 
deviation  gains  to  a  widely  cited  article  (Anderson,  Corbett,  Koedinger,  & 
Pelletier,  1995)  that  summarized  findings  from  several  influential  studies.  The 
belief  that  human  tutors  raise  student  achievement  levels  by  2  standard  deviations 
stems  from  an  influential  article  by  Bloom  (1984),  who  coined  the  term  two- 
sigma  problem,  to  denote  the  search  for  other  teaching  approaches  that  are  as 
effective  as  human  tutoring. 

More  recent  reviews  support  conventional  beliefs  about  CAI  tutoring  effects. 
For  example,  a  1994  review  aggregated  results  from  12  separate  meta-analyses  on 
computer-based  instruction  carried  out  at  eight  different  research  centers  (J.  A. 
Kulik,  1994).  Each  of  the  analyses  yielded  the  conclusion  that  computer-based 
instruction  improves  student  learning  to  a  moderate  degree.  The  median  effect  of 
computer-based  instruction  in  the  12  meta-analyses  was  an  increase  in  test  scores 
of  0.38  standard  deviations,  or  from  the  50th  to  the  64th  percentile.  More  recently, 
Tamim,  Bernard,  Borokhovski,  Abrami,  and  Schmid  (2011)  reviewed  results  from 
25  meta-analyses  on  instructional  technology  and  learning.  None  of  the  analyses 
covered  ITSs.  Median  effect  of  instructional  technology  in  all  25  meta-analyses 
was  an  improvement  in  test  scores  of  0.35  standard  deviations.  Median  effect  in 
the  14  analyses  that  focused  exclusively  on  CAI  or  computer-based  instruction 
was  an  improvement  of  0.26  standard  deviations.  Taken  together,  Tamim  et  al.’s 
(2011)  and  J.  A.  Kulik’s  (1994)  reviews  suggest  that  test  score  improvements  of 
around  one-third  standard  deviation  are  typical  for  studies  of  CAI  tutoring. 

It  is  much  harder  to  find  support  for  conventional  beliefs  about  effects  of 
human  tutoring.  Bloom  (1984)  based  his  claim  for  two-sigma  tutoring  effects  on 
two  studies  carried  out  by  his  graduate  students  (Anania,  1981;  Burke,  1980). 
Each  of  the  studies  compared  performance  of  a  conventionally  taught  control 
group  with  performance  of  two  mastery  learning  groups,  one  taught  with  and  one 
taught  without  the  assistance  of  trained  undergraduate  tutors.  Without  tutoring, 
the  mastery  system  raised  test  scores  1 .2  standard  deviations  above  control  scores. 
Adding  undergraduate  tutors  to  the  mastery  program  raised  test  scores  an 
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additional  0.8  standard  deviations,  yielding  a  total  improvement  of  2.0  standard 
deviations.  This  improvement  is  thus  the  combined  effect  of  tutorial  assistance 
plus  special  mastery  learning  materials  and  procedures.  Neither  Anania  (1981) 
nor  Burke  (1980)  evaluated  the  effects  of  tutoring  alone.  Because  the  studies  con¬ 
founded  mastery  and  tutoring  treatments,  it  is  important  to  look  beyond  them  for 
direct  evidence  on  tutoring  effects. 

An  early  meta-analytic  review  (Hartley,  1977),  which  examined  29  studies  of 
peer  tutoring  in  elementary  and  secondary  school  mathematics,  reported  that 
tutoring  programs  raised  math  test  scores  by  an  average  of  0.60  standard  devia¬ 
tions.  P.  A.  Cohen,  Kulik,  and  Kulik  (1982)  reported  an  average  improvement  of 
0.40  standard  deviations  in  65  studies  of  peer  tutoring  programs  in  elementary  and 
secondary  schools.  Mathes  and  Fuchs  (1994)  found  an  improvement  of  0.36  stan¬ 
dard  deviations  in  1 1  studies  of  peer  tutoring  in  reading  for  students  with  mild 
disabilities.  G.  W.  Ritter,  Barnett,  Denny,  and  Albin  (2009)  examined  the  effec¬ 
tiveness  of  adult  tutors  in  elementary  schools  and  reported  that  tutoring  improved 
student  performance  by  0.30  standard  deviations  in  24  studies.  Finally,  VanLehn 
(2011)  summarized  results  from  10  studies  of  human  tutoring,  including  Anania’s 
(1981)  study.  The  median  effect  of  human  tutoring  in  the  10  studies  was  a  test 
score  increase  of  0.79  standard  deviations.  Without  Anania’s  study,  the  median 
increase  was  0.68  standard  deviations.  The  median  effect  of  human  tutoring  in  the 
five  meta-analyses  was  an  improvement  in  performance  of  0.40,  far  from  Bloom’s 
(1984)  two-sigma  effect. 

Reviewers  have  not  yet  reached  a  consensus  on  the  size  of  ITS  effects  on  stu¬ 
dent  learning.  The  most  favorable  conclusions  come  from  early  evaluations  of 
Cognitive  Tutor,  the  most  widely  used  of  all  ITSs.  Corbett,  Koedinger,  and 
Anderson  (1997),  for  example,  reported  an  average  improvement  in  test  scores  of 
1  standard  deviation  from  early  versions  of  Cognitive  Tutor.  They  calculated  this 
average  from  three  sources:  overall  results  reported  by  Anderson,  Boyle,  Corbett, 
and  Lewis  (1990)  and  Corbett  and  Anderson  (1991);  improvements  on  locally 
developed  tests  found  in  a  study  by  Koedinger,  Anderson,  Hadley,  and  Mark 
(1997);  and  improvements  for  an  experienced  user  of  intelligent  tutoring  pro¬ 
grams  found  in  a  study  by  Koedinger  and  Anderson  (1993). 

Four  recent  reviews  have  reported  moderate  effects  from  intelligent  tutoring 
(Ma,  Adesope,  Nesbit,  &  Liu,  2014;  Steenbergen-Hu  &  Cooper,  2014;  U.S. 
Department  of  Education,  Institute  of  Education  Sciences,  What  Works 
Clearinghouse,  2009;  VanLehn,  2011).  The  What  Works  Clearinghouse  review, 
the  earliest  of  the  four,  focused  on  the  use  of  Cognitive  Tutor  in  middle  school 
mathematics.  The  What  Works  evaluators  found  that  only  1  of  the  14  studies  that 
they  examined  met  their  criteria  for  an  acceptable  evaluation.  This  study  (S.  Ritter, 
Kulikowich,  Lei,  McGuire,  &  Morgan,  2007)  reported  that  Cognitive  Tutor 
improved  student  test  scores  by  0.38  standard  deviations.  What  Works  evaluators 
consider  effects  of  0.25  standard  deviations  and  higher  to  be  of  substantive  impor¬ 
tance,  so  they  classified  this  effect  as  a  potentially  important  one. 

The  meta-analysis  by  VanLehn  (2011)  analyzed  results  from  54  comparisons 
of  learning  outcomes  for  ITSs  and  nontutored  groups.  The  54  comparisons  were 
found  in  28  separate  evaluation  studies.  The  average  ITS  effect  in  the  54  compari¬ 
sons  was  an  improvement  in  tests  scores  of  0.58  standard  deviations.  VanLehn 
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classified  the  ITSs  used  in  the  54  comparisons  as  either  step  based  or  substep 
based.  Step-based  tutoring  provides  hints  and  explanations  on  steps  that  students 
normally  take  when  solving  problems.  Substep-based  tutoring,  which  is  a  newer 
and  more  exacting  approach,  provides  scaffolding  and  feedback  at  a  finer  level. 
Step-based  tutoring,  however,  raised  test  scores  by  0.76  standard  deviations, 
whereas  substep-based  tutoring  raised  test  scores  by  only  0.40  standard  devia¬ 
tions.  VanLehn’s  findings  suggest,  paradoxically,  that  older  and  simpler  ITSs  have 
strong  effects  on  student  performance,  whereas  newer  and  more  sophisticated 
ITSs  appear  to  be  no  more  effective  than  “nonintelligenf  ’  CAI  tutors. 

The  meta-analytic  review  by  Steenbergen-Hu  and  Cooper  (2014)  examined  35 
evaluations  of  ITS  effectiveness  in  colleges.  The  researchers  found  that  ITSs 
raised  test  scores  overall  by  approximately  0.35  standard  deviations,  but  they  also 
reported  that  type  of  control  group  strongly  influenced  evaluation  results.  ITS 
scores  were  0.86  standard  deviations  higher  than  control  scores  in  evaluations 
where  the  control  group  received  no  instruction,  0.37  standard  deviations  higher 
in  evaluations  where  the  control  group  received  conventional  instruction,  and 
0.25  standard  deviations  lower  than  control  scores  in  evaluations  where  the  con¬ 
trol  group  received  human  tutoring.  Finally,  the  meta-analysis  by  Ma  et  al.  (2014) 
analyzed  107  findings  from  73  separate  reports.  The  average  ITS  effect  in  the  107 
comparisons  was  an  improvement  in  test  scores  of  0.43  standard  deviations.  In 
addition,  Ma  et  al.  reported  that  ITS  effects  varied  as  a  function  of  type  of  ITS 
used,  nature  of  the  control  group  in  a  study,  outcome  measure  employed,  and 
other  factors. 

Three  recent  reviews  reported  no  real  improvement  in  school  performance  due 
to  the  use  of  ITSs  (Slavin,  Lake,  &  Groff,  2009;  Steenbergen-Hu  &  Cooper,  2013; 
U.S.  Department  of  Education,  Institute  of  Education  Sciences,  What  Works 
Clearinghouse,  2009).  The  reviews  by  the  What  Works  Clearinghouse  and  Slavin 
et  al.  (2009)  focused  on  Cognitive  Tutor  evaluations.  The  What  Works  reviewers 
examined  27  evaluations  of  Cognitive  Tutor  Algebra  I  in  high  schools.  Only  three 
of  the  evaluations  met  all  the  criteria  for  their  analysis;  three  others  met  the  crite¬ 
ria  with  reservations.  Findings  in  the  six  evaluations  were  mixed,  but  the  average 
effect  was  very  near  zero,  a  decrease  in  test  scores  from  the  50th  to  the  49th  per¬ 
centile.  Slavin  et  al.  analyzed  evaluations  carried  out  in  math  courses  in  both 
middle  and  high  schools.  They  located  13  evaluations,  but  only  7  of  these  met 
their  requirements  for  acceptable  studies.  Cognitive  Tutor  raised  student  test 
scores  by  an  average  of  0.12  standard  deviations  in  the  seven  evaluations.  Slavin 
et  al.  considered  this  effect  to  be  trivial.  It  was  less  than  their  cutoff  (0.20  standard 
deviations)  for  effects  of  substantive  importance. 

The  meta-analysis  by  Steenbergen-Hu  and  Cooper  (2013)  examined  26 
reports  on  K-12  mathematics  learning.  Based  on  34  comparisons  described  in 
the  reports,  Steenbergen-Hu  and  Cooper  concluded  that  ITSs  have  very  little  or 
no  overall  effect  on  learning  in  these  grades.  Test  scores  of  ITS  and  control 
students  differed  overall  by  around  0.05  standard  deviations,  a  trivial  amount. 
The  researchers  noted  that  ITS  effects  were  positive  and  somewhat  larger  in 
studies  that  were  less  than  1  year  in  duration.  Effects  were  decidedly  negative, 
however,  in  two  studies  designed  specifically  to  help  students  who  were  classi¬ 
fied  as  lower  achievers. 
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The  lack  of  consensus  about  ITSs  effectiveness  is  striking.  Questions  loom  up 
on  all  sides.  How  effective  are  ITSs?  Do  they  raise  student  performance  a  great 
deal,  a  moderate  amount,  a  small  amount,  or  not  at  all?  If  ITSs  do  have  positive 
effects,  has  their  effectiveness  declined  with  the  fine-tuning  of  the  systems  in 
recent  years?  What  accounts  for  the  striking  differences  in  review  conclusions 
about  ITS  effectiveness?  This  review  uses  meta-analytic  methods  to  answer  these 
questions. 

Method 

Glass,  McGaw,  and  Smith  (1981)  identified  four  steps  in  a  meta-analysis: 
(a)  finding  studies,  (b)  coding  study  features,  (c)  measuring  study  effects,  and 
(d)  statistically  analyzing  and  combining  findings. 

Finding  Studies 

We  used  a  two-stage  procedure  to  find  studies  for  this  analysis.  We  first  assem¬ 
bled  a  large  pool  of  candidate  reports  through  computer  searches  of  electronic 
library  databases.  We  then  examined  the  candidate  reports  individually  to  deter¬ 
mine  whether  they  contained  relevant  data  for  a  meta-analysis. 

Candidate  Reports 

To  find  these,  we  carried  out  computer  searches  of  databases  from  four  sources: 
(a)  the  Educational  Resources  Information  Clearinghouse  (ERIC),  (b)  the  National 
Technical  Information  Service,  (c)  ProQuest  Dissertations  and  Theses,  and 
(d)  Google  Scholar.  We  devised  search  strategies  that  took  into  account  the  char¬ 
acteristics  of  each  of  the  databases: 

1 .  The  ERIC  search  focused  on  documents  tagged  with  the  descriptor  intel¬ 
ligent  tutoring  system  and  one  or  more  of  the  following  descriptors: 
instructional  effectiveness,  comparative  analyses,  and  computer  software 
evaluation.  The  ERIC  search  yielded  104  reports. 

2.  The  National  Technical  Information  Service  search  focused  on  documents 
labeled  with  the  text  string  intelligent  tutoring  systems  in  the  subject  field. 
This  search  yielded  120  documents. 

3.  The  ProQuest  Dissertations  and  Theses  search  targeted  records  containing 
both  the  text  string  intelligent  tutoring  and  some  form  of  the  word  evaluate 
in  title,  abstract,  or  keyword  fields.  The  search  yielded  98  dissertations. 

4.  The  Google  Scholar  search  focused  on  reports  with  the  strings  intelligent 
tutoring,  evaluation,  control  group ,  and  learning  in  the  full  document  text. 
The  search  yielded  1,570  reports,  which  Google  Scholar  sorted  by  rele¬ 
vance  to  the  search  terms.  We  found  many  useful  reports  in  the  first  docu¬ 
ments  listed  by  Google  Scholar,  but  returns  diminished  quickly,  and  after 
200  documents  or  so,  Google  Scholar  stopped  turning  up  useful  new  leads. 
We  therefore  added  only  the  first  250  reports  to  our  list  of  candidate 
reports. 

We  found  additional  candidate  reports  by  branching  from  reference  lists  in 
reviews  found  in  the  four  database  searches.  Two  reviews  were  especially  helpful: 
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VanLehn’s  (2011)  review,  which  examined  results  of  28  ITS  evaluations,  and  a 
Carnegie  Learning  (2011)  reference  list  of  30  evaluations  ofits  Cognitive  Tutors. 
Taking  into  account  the  overlap  in  documents  located  in  these  searches,  we  esti¬ 
mate  that  our  searches  produced  approximately  550  unique  candidate  reports  for 
our  analysis. 

Final  Data  Set 

After  reviewing  a  small  sample  of  candidate  reports,  we  developed  a  list  of 
requirements  that  evaluations  had  to  meet  to  be  considered  acceptable  for  this 
meta-analysis.  The  most  important  requirement  was  that  the  treatment  group  actu¬ 
ally  received  ITS  instruction.  CAI  tutors  continue  to  be  developed,  used,  and 
evaluated,  and  it  is  possible  to  confuse  these  CAI  systems  with  ITSs. 

Carbonell  (1970)  was  one  of  the  first  to  draw  a  clear  distinction  between  the 
two  tutoring  systems.  According  to  Carbonell,  computer  tutoring  systems  are 
either  frame  oriented  or  information  structure  oriented.  We  now  refer  to  Carbonell’s 
frame-oriented  tutors  as  CAI  tutors;  his  information  structure-oriented  tutors  are 
now  known  as  ITSs.  Frame-oriented  tutors  rely  on  frames,  or  prescripted  blocks 
of  material,  to  guide  instruction.  Information-structured  tutors  rely  on  organized 
knowledge  databases,  or  information  structures;  computational  and  dialogue-gen¬ 
erating  tools  extract  relevant  information  from  these  structures  to  carry  on  tutorial 
interactions  with  learners.  Carbonell  thus  emphasized  two  key  defining  features 
of  intelligent  tutors:  (a)  an  information  structure,  or  knowledge  database,  and  (b) 
computational  and  dialogue-generating  tools  that  extract  relevant  information 
from  these  structures. 

Fletcher  (1982,  1985)  extended  the  definition  of  ITSs  (then  often  called  intel¬ 
ligent  computer-assisted  instruction  or  I  CAI)  to  include  three  key  features:  (a)  an 
explicit  domain-knowledge  model,  which  contains  the  foundations,  concepts,  and 
rules  that  experts  understand  and  use  in  solving  problems  in  the  domain;  (b)  a 
dynamic  student  model,  which  keeps  track  of  the  student’s  state  of  knowledge 
with  regard  to  the  domain;  and  (c)  a  pedagogical  module,  which  chooses  tutoring 
strategies  and  actions  to  apply  in  specific  situations  for  specific  students.  Anderson 
and  his  colleagues  (Anderson  et  al.,  1990;  Anderson  &  Reiser,  1985)  added  a 
fourth  defining  feature:  a  user  interface  that  students  use  to  communicate  flexibly 
with  the  system.  For  many  years,  these  four  structural  characteristics  were 
accepted  as  the  defining  features  of  ITS  instruction. 

VanLehn  (2006)  has  noted  that  ITSs  today  come  in  different  shapes,  sizes,  and 
designs,  but  whatever  their  structures,  they  all  share  common  behavioral  charac¬ 
teristics.  To  distinguish  between  CAI  tutors  and  ITSs,  VanLehn  first  described 
two  types  of  tutoring  behaviors:  (a)  outer  loop  behaviors,  which  give  learners 
end-of-problem  support,  including  appropriate  feedback  on  their  problem  solu¬ 
tions  and  appropriate  new  problems  to  solve;  and  (b)  inner-loop  behaviors,  which 
include  prompting,  hinting,  and  other  support  given  while  a  student  is  working  on 
a  problem.  In  VanLehn’s  view,  ITSs  display  both  inner-  and  outer-loop  behaviors, 
whereas  CAI  tutors  display  outer-loop  behaviors  only. 

Although  experts  may  differ  on  how  to  define  intelligent  tutoring,  they  usually 
agree  on  whether  specific  tutoring  systems  are  intelligent  or  not.  We  therefore 
took  a  practical  approach  to  the  matter  of  identifying  ITSs.  We  examined  three 
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factors  before  making  final  decisions.  First,  did  the  evaluator  classify  the  com¬ 
puter  tutor  as  an  ITS?  Second,  do  experts  in  the  field  also  classify  it  as  an  ITS? 
Finally,  does  the  computer  tutor,  like  a  human  tutor,  help  learners  while  they  are 
working  on  a  problem  and  not  just  after  they  have  recorded  their  solutions? 

In  addition  to  focusing  on  intelligent  tutoring,  evaluations  had  to  meet  seven 
other  requirements: 

1 .  Evaluations  included  in  the  meta-analysis  could  be  either  field  evaluations 
or  laboratory  investigations,  but  all  evaluations  had  to  use  an  experimental 
or  quasi-experimental  design.  Most  of  the  550  candidate  reports  found  in 
the  computer  searches  failed  to  meet  this  basic  requirement.  The  pool  of 
candidate  reports  included  planning  documents,  reports  on  software 
development,  impressionistic  evaluations,  case  studies,  review  docu¬ 
ments,  and  single-group  studies.  None  of  these  provided  results  that  could 
be  used  in  our  analysis. 

2.  Control  groups  had  to  receive  conventional  instruction.  A  control  group 
could  be  either  a  conventional  class  or  a  specially  constituted  group  that 
received  instruction  that  closely  approximated  conventional  teaching. 
Unacceptable  for  our  analysis  were  evaluations  in  which  control  groups 
used  materials  that  were  extracted  from  ITS  computer  interactions,  for 
example,  canned  text  groups  and  vicarious-learning  groups  that  studied 
script  derived  from  ITS  transcripts  (e.g.,  Craig,  Sullins,  Witherspoon,  & 
Gholson,  2006;  Graesser  et  al.,  2004;  VanLehn  et  ah,  2007).  Also  unac¬ 
ceptable  were  studies  in  which  control  groups  were  taught  by  human 
tutors  or  CAI  tutors  or  received  no  relevant  instruction. 

3.  Achievement  outcomes  had  to  be  measured  quantitatively  and  in  the  same 
way  in  both  treatment  and  control  groups.  Results  on  both  locally  devel¬ 
oped  posttests  and  standardized  tests  were  acceptable.  Standardized  tests 
included  district,  state,  and  national  assessments,  as  well  as  published 
tests.  School  grades  were  not  an  acceptable  outcome  measure,  because 
grades  are  often  awarded  on  a  different  basis  by  different  teachers  in  treat¬ 
ment  and  control  classes.  Also  unacceptable  for  this  meta-analysis  were 
process  measurements  made  during  the  course  of  a  treatment. 

4.  The  treatment  had  to  cover  at  least  one  problem  set  or  homework  assign¬ 
ment,  and  the  treatment  duration  had  to  be  at  least  30  minutes.  Field 
evaluations  were  usually  much  longer  in  duration  and  easily  met  this 
requirement.  Laboratory  investigations  usually  covered  only  a  small 
number  of  assignments  or  problem  sets  and  were  usually  short  in 
duration. 

5.  The  treatment  had  to  be  implemented  without  major  failures  in  the  com¬ 
puter  system  or  program  administration.  Excluded  from  this  meta-analysis 
were  results  from  implementations  that  were  substantively  disrupted  by 
software  or  hardware  failures. 

6.  Treatment  and  control  groups  had  to  be  similar  at  the  start  of  the  evalua¬ 
tion.  We  eliminated  from  our  data  set  any  evaluation  in  which  treatment 
and  control  groups  differed  by  0.5  standard  deviations  or  more  on  pretests. 
Differences  of  this  magnitude  are  too  large  to  be  adjusted  by  such 
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techniques  as  gain  score  or  covariance  analysis.  Also  eliminated  were 
evaluations  in  which  experimental  and  control  groups  were  drawn  from 
different  populations  (e.g.,  volunteers  in  the  treatment  group  and  nonvol¬ 
unteers  in  the  control  group). 

7.  Overalignment  of  a  study’s  outcome  measure  with  treatment  or  control 
instruction  was  also  a  cause  for  excluding  an  evaluation  from  our  analysis. 
Overalignment  occurred,  for  example,  when  the  outcome  measure  used 
test  items  that  were  included  in  the  instructional  materials  for  either  the 
treatment  or  control  group. 

Only  50  of  the  550  candidate  reports  described  evaluations  that  met  all  of  the 
above  requirements  and  were  thus  qualified  for  use  in  the  meta-analysis.  Along 
with  results  from  acceptable  comparisons,  a  few  of  the  50  reports  included  results 
from  unacceptable  comparisons,  for  example,  from  comparisons  with  poorly 
implemented  ITSs  or  inadequate  control  groups.  Only  results  from  the  adequate 
comparisons  were  included  in  the  meta-analysis. 

Describing  Evaluation  Features 

We  used  15  variables  to  describe  features  of  the  evaluations  (Table  1).  Our 
selection  of  the  1 5  variables  and  coding  categories  was  guided  by  our  preliminary 
examination  of  the  evaluations  along  with  our  examination  of  other  reviews  on 
intelligent  tutoring  and  CA1  tutoring.  We  originally  coded  some  observations  as 
continuous  measurements  (e.g.,  study  year,  sample  size,  and  study  length),  but  we 
later  recoded  the  observations  into  ordered  categories.  The  categorization  helped 
solve  analytic  problems  presented  by  skew,  nonnormality,  and  presence  of  outli¬ 
ers  in  the  continuous  measurements. 

Calculating  Size  of  Effects 

The  experimental  effect  size  is  defined  as  the  difference  in  posttest  means  for 
experimental  and  control  populations,  divided  by  the  within-group  population 
standard  deviation  (Glass  et  al.,  1981).  Meta-analysts  estimate  the  population 
means  and  standard  deviations  from  sample  statistics  included  in  research  reports. 
We  set  up  specific  guidelines  to  help  us  choose  the  most  appropriate  sample  sta¬ 
tistics  for  estimating  these  population  values. 

Mean  Differences 

Whenever  possible,  we  estimated  mean  differences  from  posttest  means  that 
were  adjusted  for  pretreatment  differences  either  by  covariance  or  regression 
analysis.  When  studies  did  not  report  results  from  covariance  or  regression  analy¬ 
sis,  we  estimated  mean  differences  from  pre-post  gain  scores  of  treatment  and 
control  groups.  For  studies  that  provided  neither  adjusted  means  nor  gain  score 
means,  we  estimated  mean  differences  from  raw  posttest  means.  We  set  up  these 
guidelines  to  maximize  the  precision  of  our  estimates  of  treatment  effects. 

Standard  Deviations 

We  used  raw  standard  deviations,  rather  than  adjusted  ones,  in  calculating  size 
of  effect.  Adjusted  standard  deviations  include  gain  score  and  covariate-adjusted 


49 


Downloaded  from  http://rer.aera.net  at  UCLA  on  March  8,  2016 


TABLE  1 

Fifteen  study  features  and  associated  coding  categories 


Country  (1  =  United  States,  2  =  other) 

Publication  year  (1  =  up  to  2000,  2  =  2001-2005,  3  =  2006  onward) 

Grade  level  (1  =  K-12,  2  =  postsecondary) 

Subject  (1  =  math,  2  =  other) 

Study  type 

1  =  Experimental:  short-term  studies  in  which  treatment  and  control  groups  work  on 
the  same  assignments  with  or  without  intelligent  tutoring 

2  =  Field  evaluations:  studies  that  compare  performance  in  conventionally  taught  and 
intelligent  tutoring  classes 

Sample  size  ( 1  =  up  to  80,  2  =  8 1-250,  3  =  25 1+) 

Study  duration  (1  =  up  to  4  weeks,  2  =  5-16  weeks,  3  =  17+  weeks) 

Intelligent  tutoring  system  type  ( 1  =  step  based,  2  =  substep  based) 

Cognitive  Tutor  study 

1  =  No:  not  an  evaluation  of  a  current  or  earlier  version  of  a  Carnegie  Learning  Cogni¬ 
tive  Tutor  program 

2  =  Yes:  evaluation  of  such  software 
Group  assigmnent 

1  =  Intact  groups:  existing  classes  or  groups  assigned  to  treatment  and  control  condi¬ 
tions 

2  =  Random:  participants  assigned  randomly  to  conditions 
Instructor  effects 

1  =  Different  instructors:  different  teachers  taught  treatment  and  control  groups 

2  =  Same  instructor:  same  teacher  or  teachers  taught  treatment  and  comparison  groups 
Pretreatment  differences 

1  =  Unadjusted  posttest:  posttest  means  not  adjusted  for  pretest  differences 

2  =  Adjusted  posttest:  gain  scores  or  posttest  means  adjusted  by  covariance  or  regres¬ 
sion 

Publication  bias 

1  =  Published:  study  reported  in  a  journal  article,  published  proceedings,  or  book 

2  =  Unpublished:  study  reported  in  a  dissertation  or  in  a  technical  report 
Test  type 

1  =  Local:  posttest  was  locally  developed 

2  =  Standardized:  posttest  was  a  commercial,  state,  or  district  test 
Test  format 

1  =  Constructed-response  items  only:  posttest  was  a  problem-solving  test,  essay  exam, 
etc. 

2  =  Both  constructed-response  and  objective-test  items:  posttest  included  both 
constructed-response  and  objective-test  items 

3  =  Objective  items:  posttest  was  a  multiple-choice  test  or  other  test  with  a  fixed  alter¬ 
native  format 
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standard  deviations  as  well  as  standard  deviations  derived  from  within-group 
variances  in  multifactor  experimental  designs.  Experts  usually  caution  against 
using  such  standard  deviations  in  calculating  size  of  effect  (e.g.,  Borenstein, 
2009;  Glass  et  al.,  1981).  For  reports  that  included  only  adjusted  standard  devia¬ 
tions,  we  estimated  raw  standard  deviations  using  standard  formulas  and  assum¬ 
ing  a  correlation  of  .60  between  pretests  and  posttests.  This  is  the  median 
correlation  in  five  studies  in  our  data  set  that  either  reported  pre-post  correla¬ 
tions  or  presented  data  from  which  such  correlations  could  be  derived  (Amott, 
Hastings,  &  Allbritton,  2008;  Fletcher,  2011;  Pek  &  Poh,  2005;  Suraweera  & 
Mitrovic,  2002;  VanLehn  et  al.,  2007). 

Glass  s  ES  and  Hedges  s  g 

Tamim  et  al.  (2011)  reported  that  Hedges’s  g  and  Glass’s  ES  were  the  two  esti¬ 
mators  of  size  of  effect  most  often  used  in  25  meta-analyses  on  instructional  tech¬ 
nology  conducted  during  the  past  four  decades.  Ten  of  the  25  meta-analyses, 
covering  a  total  of  239  studies,  used  Hedges’s  g  exclusively  to  report  size  of 
effects,  whereas  6  meta-analyses,  covering  505  studies,  used  Glass’s  ES  exclu¬ 
sively.  The  remaining  meta-analyses  used  either  a  different  estimator  of  size  of 
effect  (e.g.,  a  correlation  coefficient),  an  unspecified  estimator,  or  a  combination 
of  estimators. 

Glass’s  ES  and  Hedges’s  g  measure  treatment  effects  in  different  ways.  Glass’s 
ES  measures  effects  in  control  group  standard  deviations  (Glass  et  al.,  1981); 
Hedges’s  g  uses  pooled  treatment  and  control  standard  deviations  (Hedges  & 
Olkin,  1985).  We  calculated  both  Hedges’s  g  and  Glass’s  ES,  whenever  possible, 
for  studies  in  our  data  set  and  found  a  very  high  correlation  between  the  two  esti¬ 
mators  (.97).  The  average  Hedges’s  g,  however,  was  0.05  standard  deviations 
lower  than  the  average  Glass’s  ES  in  the  36  studies  for  which  we  could  make  both 
estimates;  median  g  was  0.08  standard  deviations  lower  than  the  median  ES.  In 
addition,  the  values  of  the  two  estimators  diverged  more  substantially  in  those 
cases  where  treatment  and  control  standard  deviations  were  significantly  different 
(e.g.,  Fletcher,  2011;  Fletcher  &  Morrison,  2012;  Gott,  Lesgold,  &  Kane,  1996; 
Hastings,  Arnott-Hill,  &  Allbritton,  2010;  Le,  Menzel,  &  Pinkwart,  2009;  Naser, 
2009;  Reif&  Scott,  1999). 

Using  pooled  standard  deviations  makes  a  great  deal  of  sense  when  treatment 
and  control  standard  deviations  can  be  assumed  to  be  equal.  Pooling  standard 
deviations  is  less  justifiable  when  the  two  standard  deviations  are  significantly 
different.  For  example,  pooling  is  probably  the  wrong  choice  when  a  highly 
effective  treatment  brings  all  or  almost  all  members  of  a  heterogeneous  popula¬ 
tion  up  to  a  uniformly  high  level  of  posttest  performance.  Such  highly  effective 
treatments  can  reduce  standard  deviations  significantly  below  normal  levels. 
Pooling  standard  deviations  is  probably  also  the  wrong  choice  when  a  treatment 
affects  different  students  very  differently,  for  example,  by  greatly  improving  the 
performance  of  some  while  hampering  the  performance  of  others.  Such  treat¬ 
ments  can  raise  standard  deviations  above  normal  levels. 

We  report  results  with  both  Glass’s  ES  and  Hedges’s  g,  but  we  give  primary 
emphasis  to  Glass’s  ES  and  treat  Hedges’s  g  as  an  important  supplementary 
measure.  Our  preference  for  Glass’s  ES  is  based  primarily  on  our  reluctance  to 
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make  a  blanket  assumption  that  control  and  treatment  variances  are  equal  in  the 
studies,  the  assumption  that  is  usually  made  when  standard  deviations  are 
pooled.  We  found  too  many  instances  of  unequal  treatment  and  control  vari¬ 
ances  for  us  to  be  comfortable  with  an  assumption  of  no  treatment  effect  on 
variance. 


Statistical  Analysis 

A  fundamental  choice  in  any  meta-analysis  is  whether  to  use  weighted  or 
unweighted  means  when  combining  estimators  of  size  of  effect.  Glass  et  al.  ( 1 98 1 ) 
recommend  using  unweighted  means.  Hedges  and  Olkin  (1985)  recommend  using 
weighted  ones.  The  weights  that  Hedges  and  Olkin  assign  are  different  for  fixed- 
effect  and  random-effects  analyses.  In  fixed-effect  analyses,  where  all  studies  can 
be  assumed  to  share  a  common  population  effect,  they  weight  the  observed  esti¬ 
mators  of  size  of  effect  by  the  inverse  of  their  standard  errors,  which  is  roughly 
equivalent  to  weighting  by  sample  size.  In  random-effects  analyses,  where  an 
assumption  of  a  common  underlying  population  effect  is  untenable,  they  use  a 
more  complex  weighting  system. 

The  high  correlation  between  sample  size  and  other  important  variables  in  our 
data  set  makes  us  cautious  about  weighting  means  fully  or  in  part  by  sample  size. 
For  example,  almost  all  of  the  large  studies  in  our  data  set  evaluated  a  single  soft¬ 
ware  program.  Cognitive  Tutor,  and  measured  learning  gains  on  off-the-shelf 
standardized  tests  rather  than  local  tests  tailored  to  local  curricula.  In  addition,  the 
large  studies  in  our  data  set  were  longer  in  length  and  probably  lower  in  imple¬ 
mentation  quality  than  small  studies.  If  we  used  weighted  means  exclusively  in 
our  analyses,  our  conclusions  woidd  be  very  heavily  influenced  by  a  few  large- 
scale  evaluations  of  Cognitive  Tutor.  For  example,  if  we  assigned  weights  for  a 
fixed-effect  analysis,  the  largest  evaluation  in  our  data  set  (with  9,840  students) 
would  receive  nearly  750  times  the  weight  of  the  smallest  (with  24  students).  With 
the  weights  assigned  in  a  random-effects  analysis,  the  largest  evaluation  receives 
about  5  times  the  weight  of  the  smallest.  Without  weighting,  each  evaluation 
study  receives  the  same  weight. 

We  calculated  unweighted  means  for  our  primary  analyses,  but  we  also  calcu¬ 
lated  weighted  means  for  our  supplementary  analyses.  We  calculated  the  weighted 
means  using  the  procedures  that  Hedges  and  his  colleagues  developed  for  ran¬ 
dom-effects  analyses  (Comprehensive  Meta-Analysis,  Version  2.2.064).  We  do 
not  include  any  results  from  fixed-effect  analyses,  because  a  fixed-effect  model 
does  not  fit  our  data  set.  A  fixed-effect  model  would  not  accurately  represent  the 
uniqueness  and  diversity  of  the  individual  treatments  and  measures  used  in  the 
evaluations.  In  our  experience,  fixed-effect  models  are  seldom  if  ever  appropriate 
for  meta-analytic  data  sets  in  education  and  the  social  sciences. 

Another  decision  in  meta-analysis  involves  the  treatment  of  evaluation  reports 
with  multiple  findings.  Some  meta-analysts  report  a  single  value  for  size  of  effect 
for  each  evaluation  study;  some  report  as  many  values  as  there  are  independent 
groups  in  the  study.  We  used  both  approaches.  We  carried  out  our  primary  analy¬ 
ses  with  each  study  represented  by  a  single  value  for  size  of  effect,  and  we  carried 
out  supplemental  analyses  with  each  evaluation  report  represented  by  as  many 
independent  groups  as  were  included  in  the  evaluation. 
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The  50  reports  located  for  this  meta-analysis  are  a  diverse  group  (Table  2). 
They  describe  evaluations  that  were  carried  out  on  four  continents  over  the  course 
of  nearly  three  decades.  The  content  taught  ranged  from  “borrowing”  in  third- 
grade  subtraction  to  solving  analytic  problems  from  the  Law  School  Admissions 
Test.  The  evaluations  took  place  in  elementary  schools,  high  schools,  colleges, 
and  military  training  institutions.  The  shortest  of  the  evaluations  provided  less 
than  1  hour  of  intelligent  tutoring;  the  longest  provided  intelligent  tutoring  for 
three  semesters,  or  48  weeks. 


Overall  Effects 

For  our  primary  analysis,  we  used  Glass’s  ES  as  the  estimator  of  size  of  effect, 
evaluation  study  as  the  unit  of  analysis,  and  unweighted  means  to  represent  com¬ 
bined  effects.  Supplementary  analyses  used  Hedges’s  g  as  the  estimator  of  size  of 
effect,  both  evaluation  study  and  evaluation  finding  as  units  of  analysis,  and  both 
weighted  means  and  unweighted  means  to  represent  overall  effects. 

Primary’  Analysis 

Students  who  received  intelligent  tutoring  outperformed  control  students  on 
posttests  in  46  (or  92%)  of  the  50  studies.  In  39  (or  78%)  of  the  50  studies,  tutor¬ 
ing  gains  were  larger  than  0.25  standard  deviations,  or  large  enough  to  be  consid¬ 
ered  of  substantive  importance  by  the  standards  of  the  What  Works  Clearinghouse 
(U.S.  Department  of  Education,  Institute  of  Education  Sciences,  What  Works 
Clearinghouse,  2013).  Thus,  the  vast  majority  of  studies  found  ITS  effects  that 
were  not  only  positive  but  also  large  enough  to  be  important  for  instruction. 

The  strongest  effects  in  the  50  evaluations  were  produced  by  the  DARPA 
Digital  Tutor,  an  ITS  developed  to  teach  U.S.  Navy  personnel  the  knowledge  and 
skills  needed  by  information  systems  technicians  in  duty  station  settings.  The 
DARPA  Digital  Tutor  was  evaluated  in  two  separate  summative  evaluations 
(Fletcher,  2011;  Fletcher  &  Morrison,  2012).  Each  of  the  evaluations  compared 
end-of-course  test  scores  from  a  Digital  Tutor  course  with  scores  from  a  standard 
classroom  course.  In  the  first  evaluation,  the  Digital  Tutor  course  lasted  8  weeks, 
and  the  classroom  course,  17  weeks.  In  the  second  evaluation,  the  Digital  Tutor 
course  lasted  16  weeks,  and  the  classroom  course,  35  weeks.  Both  of  the  evalua¬ 
tions  measured  outcomes  on  locally  developed,  third-party  tests:  a  4-hour  written 
test  and  a  half-hour  oral  examination  given  by  a  review  board.  The  first  evaluation 
also  included  two  tests  of  individual  problem  solving;  the  second  evaluation 
included  measurement  of  troubleshooting  skills  of  three-member  teams  that 
responded  to  actual  requests  for  shore-based  assistance.  Average  ES  in  the  first 
evaluation  was  1.97  (Fletcher,  2011);  average  ES  in  the  second  evaluation  was 
3.18  (Fletcher  &  Morrison,  2012). 

Both  ESs  are  outliers,  the  only  ones  in  the  data  set,  where  an  outlier  is  defined 
as  a  high  value  that  is  at  least  1 .5  interquartile  ranges  above  the  75th  percentile  or 
a  low  value  that  is  at  least  1.5  interquartile  ranges  below  the  25th  percentile.  To 
keep  these  extreme  values  from  having  an  undue  influence  on  results,  we  formed 
a  90%  Winsorized  data  set  by  substituting  the  value  at  the  95th  percentile  for  these 

(Text  continues  on  p.  60.) 
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Descriptive  information  and  ESs  for  50  ITS  evaluations 
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Carlson  and  Miller  Writing,  2  high  schools,  San  852  students  (429  Fundamental  Skills  SBT  1  term,  9  sessions,  Local  0.78 

(1996)  Antonio,  TX,  1993  T,  423  C)  Training  Project’s  8  hours  total 

R-WISE  1.0 
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Graesser,  Moreno,  et  al.  Lesson  in  computer  literacy  at  81  students  AutoTutor  1 .0  and  SSBT  1  session,  45-55  Local  0.17 

(2003);  reanalyzed  in  University  of  Memphis  2.0  minutes 

Graesser  et  al.  (2004) 


Grubisic,  Stankov,  Rosie,  Introduction  to  computer  science,  39  students  xTex-Sys  (extended  SBT  14  weeks  Local  1.23 

and  Zitko  (2009)  University  of  Split,  Croatia,  (20  T,  1 9  C)  Tutor-Expert 
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Hadley,  and  Mark  Pittsburgh,  PA,  Grade  9,  (470  T,  120  C)  Algebra  Tutor)  standardized 

(1997)  1993-1994  and  Pittsburgh 

Urban  Math  Project 


Le,  Menzel,  and  Pinkwart  Computer  programing,  University  35  students  INCOM  SBT  1  session,  1  hour  Local  0.31 
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(2001);  also  Graesser 
et  al.  (2004) 

Phillips  and  Johnson  Financial  Accounting,  University  of  139  students  ITS  SBT  1  homework  Local  0.39 

(2011)  Saskatchewan  assignment 
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Skills) 

Suraweera  and  Mitrovic  Database  design,  University  of  62  students  KERMIT  SBT  1  session,  about  1  Local  0.56 

(2002)  Canterbury,  Christchurch,  New  hour 

Zealand,  August  2001 
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Kulik  &  Fletcher 

two  outlier  values  and  also  substituting  the  value  at  the  5th  percentile  for  the  two 
lowest  observed  values.  We  report  averages  for  both  the  original  data  set  and  the 
90%  Winsorized  data  set  below. 

The  median  ES  in  the  original  data  set  is  0.66.  The  mean  ES  is  0.65;  the  stan¬ 
dard  deviation  is  0.56.  In  the  Winsorized  data  set,  median  is  0.66,  mean  is  0.61, 
and  standard  deviation  is  0.38.  An  improvement  in  test  scores  of  0.66  standard 
deviations  over  conventional  levels  is  equivalent  to  an  improvement  from  the 
50th  to  the  75th  percentile.  According  to  J.  Cohen  (1988),  an  effect  of  0.20  stan¬ 
dard  deviations  is  small,  0.50  standard  deviations  is  medium  size,  and  0.8  stan¬ 
dard  deviations  is  large.  By  these  standards,  the  average  ES  for  intelligent  tutoring 
is  moderate  to  large. 

Supplementary  Analyses 

We  calculated  the  same  statistics  for  the  63  independent  comparisons  included 
in  the  50  studies.  Results  were  affected  very  little  by  this  change  in  unit  of  analy¬ 
sis.  For  example,  the  median  ES  is  0.63  in  the  data  set  of  63  independent  compari¬ 
sons.  Mean  ES  is  0.62  without  Winsorization  and  0.59  with  Winsorization.  In  58 
(or  92%)  of  the  63  comparisons,  the  ITS  group  scored  higher  than  the  control 
group;  and  in  49  (or  78%)  of  the  comparisons,  the  improvement  due  to  ITS  use 
was  substantively  important,  or  more  than  0.25  standard  deviations. 

Results  were  only  slightly  different  when  we  calculated  the  same  statistics  for 
Hedges’s  g  without  weighting  means.  With  evaluation  study  as  the  unit  of  analysis, 
the  median  g  is  0.64  for  the  50  cases.  The  mean  g  is  0.62  without  Winsorization  and 
0.60  with  Winsorization.  With  independent  comparison  as  the  unit  of  analysis,  the 
median  g  is  0.61  for  the  63  comparisons.  The  mean  g  is  0.59  without  Winsorization 
and  0.57  with  Winsorization.  Results  changed,  however,  when  we  used  weighted 
means  in  the  analysis.  With  evaluation  report  as  the  unit  of  analysis  and  weighting 
based  on  a  random-effects  model,  the  average  g  =  0.50,  95%  Cl  [0.40,  0.59],  p  < 
.001.  With  evaluation  finding  as  the  unit  of  analysis  and  weighting  based  on  a  ran¬ 
dom-effects  model,  the  average  g  =  0.49,  95%  Cl  [0.40,  0.58],  p  <  .001. 

Evaluation  Features  and  Effects  Overall 

Although  ITSs  most  often  improved  learning  by  moderately  large  amounts, 
their  effects  were  very  large  in  some  studies  and  near  zero  in  others.  To  determine 
whether  study  features  were  related  to  the  variation  in  results,  we  carried  out  a 
series  of  univariate  analyses  of  variance  (ANOVAs)  with  study  feature  as  inde¬ 
pendent  variable  and  size  of  effect  as  dependent  variable. 

Primary’  Analysis 

The  dependent  variable  in  the  primary  analysis  was  Glass’s  ES,  evaluation 
study  was  the  unit  of  analysis,  and  the  Winsorized  data  set  was  used  to  keep  outli¬ 
ers  from  having  an  inordinate  influence  on  the  analysis.  Results  show  that  test  type 
is  the  study  feature  most  strongly  related  to  ES  (Table  3).  ESs  are  large  in  evalua¬ 
tions  that  used  local  tests  as  outcome  measures  (average  ES  =0.73),  small  in  evalu¬ 
ations  that  used  standardized  tests  (average  ES  =  0.13),  and  intermediate  in 
evaluations  that  used  a  combination  of  the  two  (average  ES  =  0.45).  Five  additional 
study  features  are  also  strongly  related  to  ES:  sample  size,  grade  level  of 


60 


Downloaded  from  http://rer.aera.net  at  UCLA  on  March  8,  2016 


TABLE  3 


Relationship  between  study  features  and  study  effects 


Study  feature 

r  with  ES 

Categories 

N 

Category  ES 

M  SD 

Test  type 

-.63*** 

1  =  Local 

38 

0.73 

0.32 

2  =  Local  and  standardized 

3 

0.45 

0.24 

3  =  Standardized 

9 

0.13 

0.17 

Sample  size 

1  =  Up  to  80  participants 

26 

0.78 

0.34 

2  =  81  through  250  participants 

10 

0.53 

0.31 

3  =  More  than  250  participants 

13 

0.30 

0.30 

Grade  level 

.41** 

1  =  Elementary  and  high  school 

23 

0.44 

0.33 

2  =  Postsecondary 

27 

0.75 

0.36 

Subject 

.41** 

1  =  Mathematics 

18 

0.40 

0.34 

2  =  Other 

32 

0.72 

0.35 

Test  item 

-.33* 

1  =  Constructed  response  only 

15 

0.84 

0.26 

format 

2  =  Constructed  and  objective 

14 

0.47 

0.36 

3  =  Objective  only 

17 

0.53 

0.44 

Cognitive 

-.28* 

1  =  No 

35 

0.68 

0.34 

Tutor  study 

2=  Yes 

15 

0.45 

0.42 

Country 

.26 

1  =  United  States 

39 

0.56 

0.38 

2  =  Other 

11 

0.79 

0.31 

Publication 

-.25 

1  =  Published 

35 

0.67 

0.35 

bias 

2  =  Unpublished 

15 

0.46 

0.42 

Publication 

-.21 

1  =  Up  to  2000 

12 

0.78 

0.21 

year 

2  =  2001  through  2005 

14 

0.55 

0.40 

3=  After  2006 

24 

0.56 

0.41 

Pretreatment 

-.17 

1  =  Unadjusted  posttest 

14 

0.72 

0.39 

differences 

2  =  Adjusted  posttest 

35 

0.58 

0.37 

Group 

-.15 

1  =  Intact  groups 

32 

0.67 

0.38 

assignment 

2  =  Random  assignment 

13 

0.55 

0.38 

Study  duration 

-.13* 

1  =  Up  to  7  weeks 

22 

0.64 

0.33 

2  =  8  weeks  or  more 

25 

0.54 

0.42 

Instructor 

.10 

1  =  Different  instructors 

16 

0.55 

0.43 

effects 

2  =  Same  instmctor 

30 

0.63 

0.37 

Tutoring  steps 

.02 

1  =  Step  based 

41 

0.60 

0.39 

2  =  Substep  based 

9 

0.63 

0.31 

Study  type 

.01 

1  =  Experimental  study 

15 

0.60 

0.33 

2  =  Field  evaluation 

35 

0.61 

0.40 

Note.  ES  =  Glass’s  estimator  of  effect  size. 
*p<  .05.  **p<  .01.  ***p<  .001. 


participants,  subject  taught,  test  item  format,  and  the  tutoring  system  used  in  the 
evaluation.  Specifically,  study  effects  are  smaller  when  (a)  outcomes  are  measured 
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on  standardized  rather  than  local  tests,  (b)  sample  size  is  large,  (c)  participants  are 
at  lower  grade  levels,  (d)  the  subject  taught  is  math,  (e)  a  multiple-choice  test  is 
used  to  measure  outcomes,  and  (f)  Cognitive  Tutor  is  the  ITS  used  in  the 
evaluation. 

Supplementary  Analyses 

We  carried  out  three  parallel  series  of  ANOVAs  with  the  following  estimators  of 
effect  magnitude  and  units  of  analysis:  (a)  Glass’s  ES  as  estimator  and  evaluation 
finding  as  unit  of  analysis,  (b)  Hedges’s  g  as  the  estimator  and  evaluation  study  as 
the  unit,  and  (c)  Hedges’s  g  as  the  estimator  and  evaluation  finding  as  the  unit.  We 
used  the  90%  Winsorized  data  set  in  each  of  the  analyses  to  keep  outlier  values 
from  having  an  inordinate  influence  on  results.  The  results  of  these  ANOVAs  are 
similar  to  the  results  in  Table  3.  Each  set  of  analyses  showed  that  test  type  was  the 
study  feature  most  strongly  related  to  size  of  effect,  and  each  found  that  the  five 
other  study  features  mentioned  above  were  strongly  related  to  size  of  effect. 

We  also  carried  out  two  supplementary  analyses  of  the  90%  Winsorized  data 
set  that  used  Hedges’s  g  as  the  estimator  of  size  of  effect  and  Hedges’s  homogene¬ 
ity  procedures  as  the  analytic  method.  Evaluation  study  was  the  unit  in  one  analy¬ 
sis;  evaluation  finding  was  the  unit  in  the  other.  Overall,  these  analyses  confirmed 
the  main  AN OVA  findings.  As  in  other  analyses,  test  type  was  the  study  feature 
most  strongly  related  to  study  result.  For  example,  in  the  homogeneity  analysis  of 
evaluation  study  results,  average  g  was  0.62  when  outcomes  were  measured  on 
local  tests,  0.09  when  they  were  measured  on  standardized  tests,  and  0.46  when 
they  were  measured  on  both.  In  addition,  all  five  of  the  other  study  features  that 
were  significantly  related  to  ES  in  the  analyses  of  variance  were  significantly 
related  to  g  in  these  homogeneity  analyses.  However,  the  homogeneity  analyses 
also  detected  significant  but  smaller  relationships  between  Hedges’s  g  and  five 
other  study  features,  including  the  method  of  assigning  participants  to  treatment 
and  control  groups,  the  country  in  which  the  evaluation  was  conducted,  the  year 
in  which  it  was  conducted,  the  duration  of  the  evaluation  in  weeks,  and  whether 
the  evaluation  report  was  published  or  not. 

Key  Study  Features 

It  is  important  to  note  that  many  of  the  features  that  are  significantly  related  to 
size  of  effect  (Table  3)  are  highly  intercorrelated.  For  example,  standardized  tests 
were  used  almost  exclusively  in  large-scale  evaluations  of  Cognitive  Tutor 
Algebra  in  middle  schools  and  junior  high  schools  in  the  United  States,  and  as  a 
consequence,  test  type  is  highly  correlated  with  sample  size,  subject  taught,  and 
grade  level.  The  correlation  is  .60  between  test  type  and  sample  size,  .62  between 
test  type  and  subject  taught,  and  -.59  between  test  type  and  grade  level. 

A  small  number  of  underlying  influences — perhaps  a  single  factor — could  eas¬ 
ily  account  for  many  of  the  significant  relationships  between  study  features  and 
size  of  effects  in  Table  3.  To  identify  fundamental  influences,  we  examined  effects 
not  only  for  different  categories  of  studies  but  also  for  different  conditions  within 
studies.  In  addition,  we  examined  findings  in  a  few  studies  that  could  not  be  used 
in  our  main  meta-analysis.  We  found  that  at  least  three  factors  had  a  substantive 
influence  on  evaluation  findings:  (a)  the  type  of  posttest  used  in  a  study,  (b)  the 
type  of  control  group  in  the  study,  and  (c)  the  fidelity  of  the  ITS  implementation. 
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Test  Type 

This  is  the  study  feature  that  distinguished  most  clearly  between  studies  with 
large  and  small  effects  in  both  our  primary  and  supplementary  analyses.  An  early 
study  by  Koedinger  et  al.  (1997)  sheds  light  on  the  way  that  test  type  can  influ¬ 
ence  evaluation  results.  The  study  examined  effects  of  the  Practical  Algebra  Tutor, 
an  early  version  of  Cognitive  Tutor,  on  two  types  of  posttests:  locally  developed 
tests  that  were  aligned  with  the  problem-solving  objectives  stressed  in  the  pro¬ 
gram  and  standardized  multiple-choice  tests  that  did  not  stress  problem  solving. 
The  researchers  found  large  effects  on  the  locally  developed  tests  (mean  ES  = 
0.99)  and  smaller  effects  on  the  standardized  ones  (mean  ES  =  0.36).  They  con¬ 
cluded  that  Practical  Algebra  Tutor  was  very  effective  in  teaching  the  higher  order 
skills  it  was  designed  to  teach  and  that  it  did  not  negatively  affect  performance  on 
standardized  tests. 

Later  studies  of  Cognitive  Tutor  found  the  same  pattern  of  results.  For  exam¬ 
ple,  Corbett  (2001b,  2002)  examined  the  effects  of  Cognitive  Tutor  both  on  locally 
developed  problem-solving  tests  and  on  multiple-choice  tests  consisting  of 
released  questions  on  international,  national,  and  state  assessments.  For  Grade  7 
students,  effects  were  large  on  the  locally  developed  problem-solving  tests  (mean 
ES  =  0.71)  and  trivial  on  the  multiple-choice  questions  (mean  ES  =  0.18).  For 
Grade  8  students,  effects  were  small  (mean  ES  =  0.28)  on  local  problem-solving 
tests  but  even  smaller  on  the  multiple-choice  questions  (mean  ES=  0.13). 

The  pattern  holds  up  in  the  full  set  of  15  studies  of  Cognitive  Tutor  (Table  4). 
Overall,  Cognitive  Tutor  raised  student  performance  on  locally  developed  tests 
significantly  and  substantially  but  neither  helped  nor  hindered  student  performance 
on  standardized  tests.  The  mean  ES  on  the  standardized  tests  in  the  Cognitive  Tutor 
evaluations  is  0.12,  whereas  the  mean  ES  on  locally  developed  tests  is  0.76.  Median 
ES  on  standardized  tests  is  0.16;  median  ES  on  local  tests  is  0.86.  That  is,  Cognitive 
Tutor  boosted  performance  on  locally  developed  problem-solving  tests  that  were 
well  aligned  with  its  curricular  objectives,  but  it  did  not  boost  performance  on 
multiple-choice  standardized  tests  that  emphasized  recognition  skills. 

We  also  conducted  several  analyses  to  determine  whether  study  features  were 
related  to  size  of  effect  when  type  of  test  was  held  constant.  We  carried  out  these 
analyses  with  the  90%  Winsorized  sample  to  keep  outliers  from  having  an  undue 
influence  on  results.  We  found  that  study  features  were  not  related  to  size  of  effect 
with  test  type  held  constant.  There  were  no  significant  relationships  between 
study  features  and  effect  magnitude  in  the  38  evaluation  reports  that  measured 
outcomes  on  local  tests,  nor  were  there  any  in  the  9  evaluations  that  measured 
outcomes  on  standardized  tests.  This  was  true  whether  the  estimator  of  effect  size 
was  Glass’s  ES  or  Hedges’s  g.  It  also  made  no  difference  whether  standard 
ANOVAs,  correlations,  or  Hedges’s  homogeneity  procedures  were  used  to  study 
the  relationships. 

Control  Condition 

In  addition  to  examining  studies  with  conventional  control  groups,  we  exam¬ 
ined  results  in  6  reports,  covering  1 1  separate  experiments,  with  nonconventional 
control  groups  (Table  5).  The  nonconventional  control  groups  were  of  two  types. 
Control  students  in  the  first  type  of  experiment  read  special  materials  that  were 
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TABLE  4 

Effects  by  test  type  in  15  Cognitive  Tutor  studies 

ES 


Publication 

Local 

Standardized 

Overall 

Anderson,  Boyle,  Corbett,  and 

1.00 

1.00 

Lewis  (1990) 

Arbuckle  (2005) 

0.74 

0.74 

Cabalo  and  Vu  (2007) 

0.03 

0.03 

Campuzano,  Dynarski,  Agodini, 

-0.10 

-0.10 

and  Rail  (2009) 

Corbett  (2001b) 

0.71 

0.18 

0.45 

Corbett  (2002) 

0.28 

0.13 

0.21 

Corbett  and  Anderson  (2001) 

1.00 

1.00 

Koedinger  and  Anderson  (1993) 

0.35 

0.35 

Koedinger,  Anderson,  Hadley,  and 

0.99 

0.36 

0.68 

Mark  (1997) 

Pane,  McCaffrey,  Slaughter,  Steele, 

-0.19 

-0.19 

and  Ikemoto  (2010) 

Pane,  Griffin,  McCaffrey,  and 

0.20 

0.20 

Karam  (2013) 

Reiser,  Anderson,  and  Farrell 

1.00 

1.00 

(1985) 

S.  Ritter,  Kulikowich,  Lei, 

0.40 

0.40 

McGuire,  and  Morgan  (2007) 

Shneyderman  (2001) 

0.22 

0.22 

Smith  (2001) 

-0.07 

-0.07 

Mdn 

0.86 

0.16 

0.35 

Note.  ES  =  Glass’s  estimator  of  effect  size. 


derived  from  ITS  computer  interactions.  The  instructional  material  used  by  the 
control  group  therefore  overlapped  with  ITS  material.  Graesser  et  al.  (2004) 
referred  to  such  control  material  as  textbook-reduced',  VanLehn  et  al.  (2007)  called 
it  canned  text.  Control  students  in  the  second  type  of  experiment  simply  viewed 
the  recorded  tutoring  sessions  of  other  students.  The  control  students  therefore 
received  the  same  explanations  and  feedback  as  ITS  students  did  but  only  for 
problems  missed  by  paired,  or  yoked,  students  in  the  ITS  group. 

Effects  are  small  in  most  of  these  studies.  The  strongest  positive  effect  of  tutor¬ 
ing  in  the  six  reports  is  an  increase  in  posttest  scores  of  0.50  standard  deviations; 
the  largest  negative  effect  was  a  reduction  of  -0.36  standard  deviations.  The 
median  of  the  six  ESs  is  0.24,  and  the  mean  is  0. 1 8.  The  mean  ES  is  substantially 
lower  than  the  mean  ES  (0.60)  in  evaluations  with  conventional  control  groups. 
We  carried  out  several  supplementary  analyses  of  the  data  that  varied  the  unit  of 
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Descriptive  information  and  ESs  for  six  studies  of  intelligent  tutoring  systems  without  conventional  control  groups 
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Note.  T  =  treatment  group;  C  =  control  group;  SBT  =  step-based  tutoring;  SSBT  =  substep-based  tutoring;  ES  =  Glass’s  estimator  of  effect  size. 
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analysis  and  the  estimator  of  effect  magnitude.  The  supplementary  analyses  pro¬ 
duced  results  that  were  similar  to  those  in  the  primary  one. 

It  should  be  noted  that  all  six  of  the  studies  with  nonconventional  control  con¬ 
ditions  evaluated  substep-based  tutoring;  none  examined  step-based  tutoring. 
Two  variables  are  thus  confounded  in  the  six  studies:  type  of  control  condition 
and  type  of  intelligent  tutoring.  Which  of  these  is  responsible  for  the  depressed 
ESs  in  these  studies?  The  six  studies  by  themselves  do  not  provide  an  answer,  but 
we  can  answer  the  question  by  looking  back  at  step-based  and  substep-based  stud¬ 
ies  with  conventional  control  groups  (see  Table  3).  The  mean  ES  in  41  studies  of 
step-based  tutoring  with  conventional  control  groups  is  0.60,  and  the  mean  ES  in 
9  studies  of  substep-based  tutoring  with  conventional  control  groups  is  0.63.  It 
therefore  seems  safe  to  conclude  that  the  lower  ESs  in  the  six  studies  listed  in 
Table  5  are  attributable  to  the  control  conditions  in  the  studies,  not  the  type  of  ITS 
evaluated. 

Implementation  Adequacy 

The  adequacy  of  intelligent  tutoring  implementations  also  affects  the  strength 
of  evaluation  findings.  Evidence  on  this  point  comes  from  four  studies  that 
reported  data  from  both  weaker  and  stronger  implementations  of  an  ITS.  The 
median  ES  for  the  stronger  implementations  is  0.44;  the  median  ES  for  the  weaker 
implementations  is  -0.01.  The  evaluators  who  carried  out  these  evaluations  did 
not  directly  manipulate  implementation  adequacy  in  their  studies.  The  variation  in 
implementation  adequacy  resulted  instead  from  technical  or  training  weaknesses 
that  affected  part  but  not  all  of  the  experiments.  The  evaluators  reported  results  in 
sufficient  detail  so  that  effects  of  the  weaker  and  stronger  parts  of  the  experiments 
could  be  contrasted. 

Koedinger  and  Anderson  (1993),  for  example,  compared  results  achieved  by 
an  experienced  ITS  teacher  with  results  achieved  by  two  teachers  who  were  new 
to  ITS  instruction.  In  the  hands  of  the  experienced  teacher,  the  ITS  improved  per¬ 
formance  0.96  standard  deviations.  In  the  hands  of  teachers  with  little  prior  expe¬ 
rience  with  intelligent  tutoring,  the  ITS  had  a  negative  effect  on  student 
performance;  ES  was  -0.23.  Teachers  with  limited  experience  treated  the  ITS  as  a 
replacement  for  the  teacher,  and  they  graded  papers  and  worked  on  similar  tasks 
while  the  students  were  working  on  the  computer.  The  experienced  teacher,  on  the 
other  hand,  thought  that  the  ITS  provided  an  opportunity  for  him  to  give  more 
individualized  help  to  students.  When  students  were  working  with  the  ITS,  he 
circulated  around  the  classroom  giving  extra  help  to  those  who  needed  it  and  chal¬ 
lenging  other  students  with  additional  questions.  When  they  did  interact  with  stu¬ 
dents,  the  teachers  with  limited  experience  tended  to  focus  on  design  features  of 
the  instructional  technology,  whereas  the  experienced  teacher  moved  students 
quickly  past  the  technology  interface  and  directed  their  attention  instead  to  the 
geometry  content. 

Le  et  al.  (2009)  examined  the  effects  of  a  single  1-hour  session  of  intelligent 
tutoring  on  student’s  logic  programming  skills.  The  intelligent  tutoring  session 
was  held  on  two  separate  days.  On  the  first  day,  the  intelligent  tutoring  implemen¬ 
tation  was  poor.  Technical  problems  created  long  delays  in  the  computer  tutor’s 
responses  (e.g.,  1 -minute  delays).  The  average  ES  for  intelligent  tutoring  on  the 
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first  day  was  0.01.  Technical  problems  were  resolved  by  the  second  day  of  the 
experiment,  and  the  average  ES  for  intelligent  tutoring  rose  to  0.3 1 . 

Pane,  Griffin,  McCaffrey,  and  Karam  (2013)  found  significantly  different 
effects  during  the  first  and  second  years  of  an  implementation  of  Cognitive  Tutor 
Algebra  1.  Nearly  10,000  Algebra  I  students  were  included  in  the  evaluation  dur¬ 
ing  the  first  year  of  the  Cognitive  Tutor  program,  and  another  10,000  students 
were  included  during  the  second  year.  Pane  et  al.  reported  that  Cognitive  Tutor 
had  no  significant  effect  on  student  test  scores  when  teachers  were  using  it  for  the 
first  time  (mean  ES  =  -0.06),  but  it  had  a  small  but  highly  significant  positive 
effect  when  teachers  used  it  for  a  second  time  (mean  ES  =  0.20). 

Finally,  VanLehn  et  al.  (2005)  reported  results  from  5  years  of  use  of  the  Andes 
tutoring  system  at  the  U.S.  Naval  Academy.  In  the  first  year,  the  Andes  system 
presented  students  with  relatively  few  physics  problems  and  the  program  con¬ 
tained  a  relatively  large  number  of  bugs.  In  the  first  year  of  the  program,  ES  for 
hour  exams  was  0.2 1 .  In  the  second  through  fifth  years  of  the  program,  the  num¬ 
ber  of  physics  problems  was  increased,  and  bugs  were  fixed.  Average  ES  for  hour 
exams  for  these  5  years  was  0.57. 

Discussion 

This  meta-analysis  shows  that  ITSs  can  be  very  effective  instructional  tools. 
Students  who  received  intelligent  tutoring  outperformed  students  from  conven¬ 
tional  classes  in  46  (or  92%)  of  the  50  controlled  evaluations,  and  the  improve¬ 
ment  in  performance  was  great  enough  to  be  considered  of  substantive  importance 
in  39  (or  78%)  of  the  50  studies.  The  median  ES  in  the  50  studies  was  0.66,  which 
is  considered  a  moderate-to-large  effect  for  studies  in  the  social  sciences.  It  is 
roughly  equivalent  to  an  improvement  in  test  performance  from  the  50th  to  the 
75  th  percentile. 

This  is  stronger  than  typical  effects  from  other  forms  of  tutoring.  C.-L.  C. 
Kulik  and  Kulik’s  (1991)  meta-analysis,  for  example,  found  an  average  ES  of 
0.31  in  165  studies  of  CAI  tutoring.  ITS  gains  are  about  twice  as  high.  The  ITS 
effect  is  also  greater  than  typical  effects  from  human  tutoring.  As  we  have  seen, 
programs  of  human  tutoring  typically  raise  student  test  scores  about  0.4  stan¬ 
dard  deviations  over  control  levels.  Developers  of  ITSs  long  ago  set  out  to 
improve  on  the  success  of  CAI  tutoring  and  to  match  the  success  of  human 
tutoring.  Our  results  suggest  that  ITS  developers  have  already  met  both  of  these 
goals. 

ITS  effects  are  also  robust.  The  50  controlled  evaluations  we  reviewed  took 
place  at  different  times,  in  different  places,  and  in  different  educational  settings. 
Although  the  settings  were  diverse,  moderately  strong  ITS  effects  were  the  rule. 
For  example,  the  50  evaluations  in  our  meta-analysis  were  carried  out  in  nine 
countries  on  four  continents.  A  total  of  39  (or  78%)  of  the  studies  were  done  in  the 
United  States,  where  ITSs  were  first  developed,  and  11  (or  22%)  were  done  out¬ 
side  the  United  States.  The  average  ES  found  in  studies  conducted  within  the 
United  States  was  0.56;  the  average  ES  in  studies  conducted  outside  the  United 
States  was  0.79.  It  appears  therefore  that  ITSs  have  not  only  traveled  far  from 
their  country  of  origin  but  also  traveled  well.  They  appear  to  be  just  as  effective 
abroad  as  they  are  at  home. 
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We  found  one  important  exception  to  the  rule  of  moderately  strong  positive 
effects  in  the  50  controlled  evaluations.  Although  effects  were  moderate  to  strong 
in  evaluations  that  measured  outcomes  on  locally  developed  tests,  they  were 
much  smaller  in  evaluations  that  measured  outcomes  on  standardized  tests. 
Average  ES  on  studies  with  local  tests  was  0.73;  average  ES  on  studies  with  stan¬ 
dardized  tests  was  0.13.  This  discrepancy  is  not  unusual  for  meta-analyses  that 
include  both  local  and  standardized  tests.  A  meta-analysis  by  Rosenshine  and 
Meister  (1994),  for  example,  found  that  reciprocal  teaching  systems  raised  stu¬ 
dent  performance  0.88  standard  deviations  on  local  tests  but  only  0.32  standard 
deviations  on  standardized  tests.  A  meta-analysis  by  C.-L.  C.  Kulik,  Kulik,  and 
Bangert-Drowns  (1990)  found  mastery  learning  systems  boosted  student  perfor¬ 
mance  by  0.57  standard  deviations  on  local  tests  but  by  only  0.29  standard  devia¬ 
tions  on  standardized  tests. 

Which  kind  of  test  should  we  trust?  Both  local  and  standardized  tests  have 
their  champions.  Some  evaluators  prefer  local  tests,  because  local  tests  are  likely 
to  align  well  with  the  objectives  of  specific  instructional  programs.  Off-the-shelf 
standardized  tests  provide  a  looser  fit.  Evaluators  who  prefer  standardized  tests, 
on  the  other  hand,  usually  praise  them  for  being  free  of  bias.  Unlike  local  tests, 
which  may  be  written  by  developers  or  supporters  of  an  experimental  program, 
standardized  tests  are  almost  always  third-party  affairs.  The  authors  of  standard¬ 
ized  tests  can  hardly  slant  them  to  favor  one  group  or  another  in  future  evaluation 
studies. 

Our  own  belief  is  that  both  local  and  standardized  tests  provide  important 
information  about  instructional  effectiveness,  and  when  possible,  both  types  of 
tests  should  be  included  in  evaluation  studies.  We  think  that  Koedinger  et  al. 
(1997)  were  on  the  right  track  when  they  included  both  standardized  and  local 
tests  in  their  pioneering  ITS  evaluation.  They  found  strong  ITS  effects  on  local 
tests  that  were  aligned  with  the  curriculum  and  smaller  effects  on  standardized 
tests  that  were  not.  The  ITS  thus  improved  the  problem-solving  skills  it  was 
designed  to  teach,  and  the  improvement  in  problem  solving  came  at  no  cost  to  the 
recognition  skills  emphasized  on  standardized  tests.  We  suspect  that  the  same 
conclusion  may  be  appropriate  for  ITSs  in  general.  Only  the  wider  use  of  both 
standardized  and  local  tests  in  ITS  evaluations  will  provide  conclusive  evidence. 

Another  factor  that  affects  ITS  evaluation  results  is  the  type  of  control  group 
used  in  a  study.  Specifically,  results  are  different  for  studies  with  conventional  and 
nonconventional  control  groups.  Median  ES  is  0.66  in  studies  with  conventional 
control  groups.  Median  ES  is  0.28  in  studies  with  nonconventional  control  groups 
that  were  taught  with  materials  derived  from  the  ITS  interactions.  Studies  with 
nonconventional  control  groups  can  be  useful  in  determining  how  ITSs  work,  but 
they  do  not  give  a  useful  answer  to  the  question  of  overall  ITS  effectiveness. 

A  third  factor  that  can  influence  results  of  an  intelligent  tutoring  program  is  the 
adequacy  of  the  program  implementation.  Very  few  ITS  evaluations  measured 
implementation  adequacy  directly,  but  four  studies  suggested  that  intelligent 
tutoring  effects  are  stronger  when  programs  are  carefully  implemented  and  weaker 
when  programs  are  not  implemented  expertly  or  when  technical  problems  affect 
implementations.  It  is  not  clear  whether  implementation  adequacy  affected  other 
studies  in  our  data  set  beyond  these  four.  On  the  one  hand,  we  did  not  include  in 
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our  main  analyses  findings  from  implementations  with  reported  inadequacies,  so 
the  effect  might  be  small.  On  the  other  hand,  ITSs  were  a  novelty  to  teachers  in 
some  large  studies  included  in  our  analyses,  and  the  teacher’s  limited  experience 
with  ITSs  may  have  affected  results  in  their  classrooms. 

Our  meta-analytic  findings  shed  light  on  some  otherwise  puzzling  conclusions 
reached  in  other  reviews  of  ITS  findings.  Reviews  of  Cognitive  Tutor  evaluations, 
for  example,  have  drawn  contradictory  conclusions  about  its  effectiveness.  Early 
reviews  reported  strong  improvements  in  student  performance  due  to  Cognitive 
Tutor  (e.g.,  Corbett  et  al.,  1997),  but  recent  reviews  have  reported  that  Cognitive 
Tutor  has  little  or  no  consistent  effect  on  student  learning  (e.g.,  Slavin  et  al.,  2009; 
U.S.  Department  of  Education,  Institute  of  Education  Sciences,  What  Works 
Clearinghouse,  2013).  We  found  that  review  findings  depend  on  the  proportion  of 
reviewed  studies  that  used  locally  developed  tests.  Early  reviews,  which  reported 
strong  positive  improvements,  based  their  conclusions  entirely  on  findings  from 
local  tests.  Recent  reviews  that  reported  little  or  no  positive  improvements  from 
Cognitive  Tutor  based  their  conclusions  entirely  on  results  from  standardized 
tests.  We  found  a  median  ES  of  0.86  on  local  tests  used  in  Cognitive  Tutor  evalu¬ 
ations,  a  median  ES  of  0.16  on  standardized  tests,  and  a  median  ES  of  0.35  for  all 
tests  used  in  Cognitive  Tutor  evaluations. 

Our  analysis  also  sheds  light  on  an  unexpected  finding  in  VanLehn’s  (2011) 
review  on  tutoring  effects.  Specifically,  VanLehn  found  an  average  size  of  effect  of 
0.76  for  an  older  and  less  exacting  form  of  ITS,  which  he  called  step-based  tutor¬ 
ing.  He  found  an  average  size  of  effect  of  only  0.40  for  substep-based  ITSs,  a 
newer  and  more  rigorous  approach.  We  found  similar  effects  for  step-based  and 
substep-based  ITSs  in  studies  with  conventional  control  groups.  However,  we 
found  smaller  effects  in  studies  of  substep-based  tutoring  with  nonconventional 
control  groups.  We  excluded  studies  with  nonconventional  control  groups  from  our 
meta-analysis,  but  VanLehn  included  them  in  his  analyses.  The  low  average  size  of 
effect  that  he  reported  for  substep-based  tutoring  thus  seems  to  be  due  more  to  the 
type  of  control  groups  in  VanLehn’s  studies  than  to  substep-based  tutoring  itself. 

Our  findings  are  clearly  different  from  those  of  Steenbergen-Hu  and  Cooper 
(2013),  who  reported  that  ITSs  had  no  real  effect  on  K-12  math  performance. 
They  found  an  average  effect  of  about  0.05  standard  deviations  in  the  26  studies 
included  in  their  meta-analysis.  In  contrast,  we  found  an  average  ES  of  0.40  in  18 
studies  of  ITS  effectiveness  in  elementary  and  high  school  mathematics.  The 
average  ES  was  0.72  in  seven  studies  that  measured  outcomes  on  local  tests,  0.45 
in  three  studies  that  measured  outcomes  on  both  standardized  and  local  tests,  and 
0.10  in  eight  studies  that  measured  outcomes  only  on  standardized  tests. 

No  single  factor  is  responsible  for  the  difference  in  findings  of  our  meta-anal- 
ysis  and  Steenbergen-Hu  and  Cooper’s  (2013),  but  it  is  important  to  note  that  the 
two  meta-analyses  defined  ITSs  differently.  Steenbergen-Hu  and  Cooper  defined 
ITSs  as  “self-paced,  learner-led,  highly  adaptive,  and  interactive  learning  envi¬ 
ronments  operated  through  computers”  (p.  983).  This  broad  definition  led  them  to 
include  in  their  meta-analysis  a  number  of  computer  systems  that  are  not  ordinar¬ 
ily  considered  to  be  ITSs.  Specifically,  their  meta-analysis  included  evaluations  of 
such  CAI  systems  as  iLearnMath,  Larson  Pre-Algebra,  Larson  Algebra,  Plato 
Algebra,  Plato  Achieve  Now,  and  an  online  remediation  system  used  in  a  study  by 
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Biesinger  and  Crippen  (2008).  These  systems  are  not  classified  as  ITSs  by  the 
developers  of  the  systems,  and  they  would  not  be  considered  to  be  ITSs  by  most 
experts  on  intelligent  tutoring.  To  use  VanLehn’s  terminology,  these  systems  are 
answer-based  CAI  tutors.  They  can  provide  feedback  on  student  answers  but  not 
on  the  thinking  that  goes  into  individual  answers.  We  therefore  excluded  evalua¬ 
tions  of  these  and  other  CAI  systems  from  our  meta-analysis. 

It  is  also  important  to  note  that  Steenbergen-Hu  and  Cooper  (2013)  had  looser 
requirements  than  we  did  for  acceptable  control  groups,  and  they  included  in  their 
meta-analysis  a  number  of  evaluations  without  adequate  control  groups.  For 
example,  their  meta-analysis  included  evaluations  by  Beal,  Walles,  Arroyo,  and 
Woolf  (2007);  Plano  (2004);  and  Walles  (2005)  in  which  treatment  and  control 
groups  differed  substantially  in  pretest  scores.  The  difference  was  equivalent  to 
0.81  standard  deviations  in  Beal’s  study,  1.09  standard  deviations  in  Plano’s,  and 
0.76  standard  deviations  in  Walles’s.  Also  included  in  Steenbergen-Hu  and 
Cooper’s  review  were  studies  with  no-instruction  controls  (Beal,  Arroyo,  Cohen, 
&  Woolf,  2010;  Biesinger  &  Crippen,  2008;  Radwan,  1997)  and  studies  that  pro¬ 
vided  no  evidence  of  baseline  equivalence  of  groups  (Carnegie  Learning  Inc., 
2001;  Corbett,  2002;  Koedinger,  2002;  Sarkis,  2004).  We  excluded  these  studies 
from  our  analysis,  because  they  did  not  appear  to  provide  a  fair  baseline  for 
assessing  the  contributions  that  ITSs  might  make. 

Overall,  the  message  from  what  we  judge  to  be  fair  comparisons  of  ITS  and 
conventional  instruction  seems  clear.  The  evaluations  show  that  ITSs  typically 
raise  student  performance  well  beyond  the  level  of  conventional  classes  and 
even  beyond  the  level  achieved  by  students  who  receive  instruction  from  other 
forms  of  computer  tutoring  or  from  human  tutors.  Although  a  small  minority  of 
ITS  studies  found  no  significant  difference  in  performance  of  ITS  and  control 
students,  most  of  these  studies  were  weak  in  design  or  execution.  Some  mea¬ 
sured  outcomes  solely  on  off-the-shelf  tests  that  were  poorly  aligned  with  the 
higher  order  curricular  objectives  emphasized  in  ITS  programs.  Other  studies 
used  nonconventional  control  groups  that  studied  special  materials  that  were 
derived  from  ITS  interactions.  Still  other  studies  suffered  from  poorly  imple¬ 
mented  ITS  treatments.  When  results  from  such  questionable  comparisons  are 
left  out  of  the  mix,  the  message  from  ITS  evaluations  is  clear,  consistent,  and 
positive. 

It  is  hard  to  predict  the  exact  shape  that  computer  tutoring  will  take  in  the 
future.  In  effect,  we  may  be  at  the  “wireless  telegraph”  phase,  with  radio  yet  to  be 
developed.  Advances  are  surely  coming  on  a  number  of  fronts — in  computer 
hardware,  software,  networking,  and  cognitive  science — and  these  advances  will 
likely  affect  both  the  appearance  and  structure  of  future  tutoring  systems.  It 
remains  to  be  seen  whether  tomorrow’s  computer  tutors  will  produce  the  two- 
sigma  improvements  that  have  so  far  eluded  most  ITS  developers,  but  the  avail¬ 
able  evidence  suggests  that  today’s  ITSs  can  serve  as  a  sound  foundation  for 
future  work. 


Note 

The  research  was  carried  out  with  support  from  the  Office  of  Naval  Research. 
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